Greenville County
Evaluating Long-Context Reasoning in LLM-Based WebAgents
Chung, Andy, Zhang, Yichi, Lin, Kaixiang, Rawal, Aditya, Gao, Qiaozi, Chai, Joyce
As large language model (LLM)-based agents become increasingly integrated into daily digital interactions, their ability to reason across long interaction histories becomes crucial for providing personalized and contextually aware assistance. However, the performance of these agents in long context scenarios, particularly for action-taking WebAgents operating in realistic web environments, remains largely unexplored. This paper introduces a benchmark for evaluating long context reasoning capabilities of WebAgents through sequentially dependent subtasks that require retrieval and application of information from extended interaction histories. We develop a novel evaluation framework that simulates multi-session user interactions by injecting irrelevant task trajectories between dependent subtasks, creating contexts ranging from 25,000 to 150,000 tokens. Through extensive evaluation of four popular models, Claude-3.7, GPT-4.1, Llama 4, and o4-mini, we observe a dramatic performance degradation as context length increases, with success rates dropping from 40-50\% in baseline conditions to less than 10\% in long context scenarios. Our detailed error analysis reveals that agents primarily fail due to getting stuck in loops and losing track of original task objectives. We further propose an implicit RAG approach that provides modest improvements by generating task-relevant summaries, though fundamental limitations in long context reasoning persist. These findings highlight critical challenges for deploying WebAgents in realistic, long-term user interaction scenarios and provide insights for developing more robust agent architectures capable of maintaining coherent task execution across extended contexts.
- North America > The Bahamas (0.14)
- North America > United States > New York (0.04)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- (11 more...)
- Workflow (0.93)
- Research Report > New Finding (0.93)
- Media (1.00)
- Consumer Products & Services (1.00)
- Transportation (0.93)
- Leisure & Entertainment > Sports > Basketball (0.46)
Beyond Awareness: Investigating How AI and Psychological Factors Shape Human Self-Confidence Calibration
Cau, Federico Maria, Spano, Lucio Davide
Human-AI collaboration outcomes depend strongly on human self-confidence calibration, which drives reliance or resistance toward AI's suggestions. This work presents two studies examining whether calibration of self-confidence before decision tasks, low versus high levels of Need for Cognition (NFC), and Actively Open-Minded Thinking (AOT), leads to differences in decision accuracy, self-confidence appropriateness during the tasks, and metacognitive perceptions (global and affective). The first study presents strategies to identify well-calibrated users, also comparing decision accuracy and the appropriateness of self-confidence across NFC and AOT levels. The second study investigates the effects of calibrated self-confidence in AI-assisted decision-making (no AI, two-stage AI, and personalized AI), also considering different NFC and AOT levels. Our results show the importance of human self-confidence calibration and psychological traits when designing AI-assisted decision systems. We further propose design recommendations to address the challenge of calibrating self-confidence and supporting tailored, user-centric AI that accounts for individual traits.
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
- Europe > Austria > Vienna (0.14)
- North America > United States > New York > New York County > New York City (0.05)
- (36 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study > Negative Result (0.46)
Sim2Real Diffusion: Leveraging Foundation Vision Language Models for Adaptive Automated Driving
Samak, Chinmay Vilas, Samak, Tanmay Vilas, Li, Bing, Krovi, Venkat
Simulation-based design, optimization, and validation of autonomous vehicles have proven to be crucial for their improvement over the years. Nevertheless, the ultimate measure of effectiveness is their successful transition from simulation to reality (sim2real). However, existing sim2real transfer methods struggle to address the autonomy-oriented requirements of balancing: (i) conditioned domain adaptation, (ii) robust performance with limited examples, (iii) modularity in handling multiple domain representations, and (iv) real-time performance. To alleviate these pain points, we present a unified framework for learning cross-domain adaptive representations through conditional latent diffusion for sim2real transferable automated driving. Our framework offers options to leverage: (i) alternate foundation models, (ii) a few-shot fine-tuning pipeline, and (iii) textual as well as image prompts for mapping across given source and target domains. It is also capable of generating diverse high-quality samples when diffusing across parameter spaces such as times of day, weather conditions, seasons, and operational design domains. We systematically analyze the presented framework and report our findings in terms of performance benchmarks and ablation studies. Additionally, we demonstrate its serviceability for autonomous driving using behavioral cloning case studies. Our experiments indicate that the proposed framework is capable of bridging the perceptual sim2real gap by over 40%.
- Transportation > Ground > Road (1.00)
- Information Technology > Robotics & Automation (1.00)
- Automobiles & Trucks (1.00)
Experimental investigation of pose informed reinforcement learning for skid-steered visual navigation
Vision-based lane keeping is a topic of significant interest in the robotics and autonomous ground vehicles communities in various on-road and off-road applications. The skid-steered vehicle architecture has served as a useful vehicle platform for human controlled operations. However, systematic modeling, especially of the skid-slip wheel terrain interactions (primarily in off-road settings) has created bottlenecks for automation deployment. End-to-end learning based methods such as imitation learning and deep reinforcement learning, have gained prominence as a viable deployment option to counter the lack of accurate analytical models. However, the systematic formulation and subsequent verification/validation in dynamic operation regimes (particularly for skid-steered vehicles) remains a work in progress. To this end, a novel approach for structured formulation for learning visual navigation is proposed and investigated in this work. Extensive software simulations, hardware evaluations and ablation studies now highlight the significantly improved performance of the proposed approach against contemporary literature.
- Oceania > Australia > Queensland > Brisbane (0.04)
- North America > United States > South Carolina > Greenville County > Greenville (0.04)
- Automobiles & Trucks (1.00)
- Transportation > Ground > Road (0.67)
Re:Member: Emotional Question Generation from Personal Memories
Rackauckas, Zackary, Minematsu, Nobuaki, Hirschberg, Julia
We present Re:Member, a system that explores how emotionally expressive, memory-grounded interaction can support more engaging second language (L2) learning. By drawing on users' personal videos and generating stylized spoken questions in the target language, Re:Member is designed to encourage affective recall and conversational engagement. The system aligns emotional tone with visual context, using expressive speech styles such as whispers or late-night tones to evoke specific moods. It combines WhisperX-based transcript alignment, 3-frame visual sampling, and Style-BERT-VITS2 for emotional synthesis within a modular generation pipeline. Designed as a stylized interaction probe, Re:Member highlights the role of affect and personal media in learner-centered educational technologies.
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.06)
- North America > United States > Florida > Miami-Dade County > Miami (0.05)
- North America > United States > South Carolina > Greenville County > Greenville (0.04)
- (2 more...)
- Research Report (0.40)
- Instructional Material (0.34)
Identifying Aspects in Peer Reviews
Lu, Sheng, Kuznetsov, Ilia, Gurevych, Iryna
Peer review is central to academic publishing, but the growing volume of submissions is straining the process. This motivates the development of computational approaches to support peer review. While each review is tailored to a specific paper, reviewers often make assessments according to certain aspects such as Novelty, which reflect the values of the research community. This alignment creates opportunities for standardizing the reviewing process, improving quality control, and enabling computational support. While prior work has demonstrated the potential of aspect analysis for peer review assistance, the notion of aspect remains poorly formalized. Existing approaches often derive aspects from review forms and guidelines, yet data-driven methods for aspect identification are underexplored. To address this gap, our work takes a bottom-up approach: we propose an operational definition of aspect and develop a data-driven schema for deriving aspects from a corpus of peer reviews. We introduce a dataset of peer reviews augmented with aspects and show how it can be used for community-level review analysis. We further show how the choice of aspects can impact downstream applications, such as LLM-generated review detection. Our results lay a foundation for a principled and data-driven investigation of review aspects, and pave the path for new applications of NLP to support peer review.
- Europe > Austria > Vienna (0.14)
- North America > Canada > Ontario > Toronto (0.04)
- Europe > Germany > Hesse > Darmstadt Region > Darmstadt (0.04)
- (13 more...)
- Research Report > New Finding (1.00)
- Overview (1.00)
- Research Report > Experimental Study (0.92)
A Multimodal Human Protein Embeddings Database: DeepDrug Protein Embeddings Bank (DPEB)
Sajol, Md Saiful Islam, Rajasekaran, Magesh, Gemeinhardt, Hayden, Bess, Adam, Alvin, Chris, Mukhopadhyay, Supratik
Computationally predicting protein-protein interactions (PPIs) is challenging due to the lack of integrated, multimodal protein representations. DPEB is a curated collection of 22,043 human proteins that integrates four embedding types: structural (AlphaFold2), transformer-based sequence (BioEmbeddings), contextual amino acid patterns (ESM-2: Evolutionary Scale Modeling), and sequence-based n-gram statistics (ProtVec]). AlphaFold2 protein structures are available through public databases (e.g., AlphaFold2 Protein Structure Database), but the internal neural network embeddings are not. DPEB addresses this gap by providing AlphaFold2-derived embeddings for computational modeling. Our benchmark evaluations show GraphSAGE with BioEmbedding achieved the highest PPI prediction performance (87.37% AUROC, 79.16% accuracy). The framework also achieved 77.42% accuracy for enzyme classification and 86.04% accuracy for protein family classification. DPEB supports multiple graph neural network methods for PPI prediction, enabling applications in systems biology, drug target identification, pathway analysis, and disease mechanism studies.
- North America > United States > Louisiana > East Baton Rouge Parish > Baton Rouge (0.04)
- North America > United States > South Carolina > Greenville County > Greenville (0.04)
- North America > United States > Nevada (0.04)
- Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
- Education > Health & Safety > School Nutrition (0.35)
"I think this is fair'': Uncovering the Complexities of Stakeholder Decision-Making in AI Fairness Assessment
Luo, Lin, Nakao, Yuri, Chollet, Mathieu, Inakoshi, Hiroya, Stumpf, Simone
Assessing fairness in artificial intelligence (AI) typically involves AI experts who select protected features, fairness metrics, and set fairness thresholds. However, little is known about how stakeholders, particularly those affected by AI outcomes but lacking AI expertise, assess fairness. To address this gap, we conducted a qualitative study with 30 stakeholders without AI expertise, representing potential decision subjects in a credit rating scenario, to examine how they assess fairness when placed in the role of deciding on features with priority, metrics, and thresholds. We reveal that stakeholders' fairness decisions are more complex than typical AI expert practices: they considered features far beyond legally protected features, tailored metrics for specific contexts, set diverse yet stricter fairness thresholds, and even preferred designing customized fairness. Our results extend the understanding of how stakeholders can meaningfully contribute to AI fairness governance and mitigation, underscoring the importance of incorporating stakeholders' nuanced fairness judgments.
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
- Europe > United Kingdom > Scotland > City of Glasgow > Glasgow (0.14)
- North America > United States > Georgia > Fulton County > Atlanta (0.14)
- (25 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Questionnaire & Opinion Survey (1.00)
- Law (1.00)
- Banking & Finance > Credit (0.67)
- Education > Educational Setting > Higher Education (0.46)
- Government > Regional Government > Europe Government (0.46)
- Information Technology > Data Science > Data Mining (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Artificial Intelligence > Issues > Social & Ethical Issues (1.00)
Where Do I 'Add the Egg'?: Exploring Agency and Ownership in AI Creative Co-Writing Systems
Carrera, Dashiel, Thomas-Mitchell, Jeb, Wigdor, Daniel
AI co-writing systems challenge long held ideals about agency and ownership in the creative process, thereby hindering widespread adoption. In order to address this, we investigate conceptions of agency and ownership in AI creative co-writing. Drawing on insights from a review of commercial systems, we developed three co-writing systems with identical functionality but distinct interface metaphors: agentic, tool-like, and magical. Through interviews with professional and non-professional writers (n = 18), we explored how these metaphors influenced participants' sense of control and authorship. Our analysis resulted in a taxonomy of agency and ownership subtypes and underscore how tool-like metaphors shift writers' expected points of control while agentic metaphors foreground conceptual contributions. We argue that interface metaphors not only guide expectations of control but also frame conceptions of authorship. We conclude with recommendations for the design of AI co-writing systems, emphasizing how metaphor shapes user experience and creative practice.
- North America > Canada > Ontario > Toronto (0.15)
- North America > United States > New York > New York County > New York City (0.05)
- North America > United States > Hawaii > Honolulu County > Honolulu (0.05)
- (14 more...)
- Research Report > New Finding (1.00)
- Personal > Interview (0.67)
ClearFairy: Capturing Creative Workflows through Decision Structuring, In-Situ Questioning, and Rationale Inference
Son, Kihoon, Choi, DaEun, Kim, Tae Soo, Kim, Young-Ho, Yun, Sangdoo, Kim, Juho
Capturing professionals' decision-making in creative workflows is essential for reflection, collaboration, and knowledge sharing, yet existing methods often leave rationales incomplete and implicit decisions hidden. To address this, we present CLEAR framework that structures reasoning into cognitive decision steps-linked units of actions, artifacts, and self-explanations that make decisions traceable. Building on this framework, we introduce ClearFairy, a think-aloud AI assistant for UI design that detects weak explanations, asks lightweight clarifying questions, and infers missing rationales to ease the knowledge-sharing burden. In a study with twelve creative professionals, 85% of ClearFairy's inferred rationales were accepted, increasing strong explanations from 14% to over 83% of decision steps without adding cognitive demand. The captured steps also enhanced generative AI agents in Figma, yielding next-action predictions better aligned with professionals and producing more coherent design outcomes. For future research on human knowledge-grounded creative AI agents, we release a dataset of captured 417 decision steps.
- North America > United States > New York > New York County > New York City (0.15)
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > United States > Georgia > Fulton County > Atlanta (0.14)
- (22 more...)
- Workflow (1.00)
- Research Report > Experimental Study (0.68)
- Research Report > New Finding (0.68)
- Personal > Interview (0.45)
- Education (1.00)
- Health & Medicine > Therapeutic Area (0.45)